Star Hotels Project

by Narges Shahmohammadi

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Exploratory Data Analysis (EDA)

Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Univariate Data Analysis

A number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel. as you can see, 2, 1 and then 3 weeknights are on the top

A number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel. As you see, most of the guests tend to do not book or stay at a hotel at the weekends.

plot of type of meal plan booked by the customer shows us that most of them just booked for Breakfast (Meal plan 1)

Only a few customers require car parking space

bar plot of type of room reserved by the customer shows that room type 1 is the most popular

The number of days between the date of booking and the guest's arrival date is highly skewed to the right It also shows that guests tend to book hotel rooms mostly between 1 to 100 days earlier.

arrival date for 2018 is on the top

most of the reservations of hotel rooms are for the month of April to August. also first and the last two months have the lowest reservation.

it seems that the guests do not book rooms very much for last day of month.

online market segment with 82% is on the top

after that offline with 12% has been located at the second place

only 3.41% of the guests are repeated guest

almost all of the guests (99%) have no previous bookings that were canceled.

avg_price_per_room has a distribution skewed to the right, I think it looks like a normal distribution. its average is around 109 Euros.

Multivariate Data Analysis

it seems that arrival_year and arrival month have correlation

difference between booking status and the number of special requests in different years. Totally, in 2019 booking the rooms with no cancelation and more number of requests are higher than others. in 2017 booking the rooms and its cancelation (with high special request) are higher than others.

The more the lead time, the more cancelation it is.

Guests who have more children did cancel their reservations more than others.

it seems that nomber of adult has no effect on cancelation

There is a increasing in price of per room between 2017 to 2019 with a mild slope

The Questions

What are the busiest months in the hotel?

The August and July are the busiest months

Which market segment do most of the guests come from?

as I showed it before, the most of the guests come from online segment

Hotel rates are dynamic and change according to demand and customer demographics.What are the differences in room prices in different market segments?

The online and complementary market segments have the highest and lowest average price, respectively.

What percentage of bookings are canceled?

33.3% of guests did canceled their books.

Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

less than 1% of repeated guests cancel their reservations

Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

It seems that guests who have high special requirements tend to do not to cancel their books.

Data Preprocessing

Creating training and test sets.

EDA

guests with a number of 2 are higher

there is no change here. still, booking with no child is on the top

after dealing with outliers it is still 2, 1, and then 3 weeknights which are on the top.

as you see, most of the guests tend to do not book or stay at a hotel at the weekends.

The plot of type of meal plan booked by the customer shows us that most of them still booked for Breakfast (Meal plan 1)

no changes in room_type_reserved

the skewness of column lead_time is still to the right.

there is no changes. Still 3.41% of the guests are repeated guest

after dealing with outliers its skewness is better and still its mean is around 113.154 Euro

Most of the customers had no special request

Multivariate Data Analysis

There is no differences between cancelation of rooms with different prices.

I can't see any differences here from before deleting the outliers.

The more the lead time, the more cancelation it is.

it is still Online market segment which is on top

The online and complementary market segments have the highest and lowest average price, respectively.

33.3% of bookings have been canceled

almost none of them cancel their reservations :(0.1%)

It seems that guests who have high special requirements tend to do not to cancel their books.

Checking Multicollinearity

arrival_year, market_segment_type_Online, room_type_reserved_Room_Type 1 and some other columns exhibit high multicollinearity. I will Remove arrival_year first to see which variable has a significant impact on the model's performance.

Now dropping market_segment_type_Online, to see which variable has a significant impact on the model's performance.

Now it is room_type_reserved_Room_Type 1 which has high VIF values so, I remove it

Now it is no_of_adults 1 which has high VIF values(18.137)

avg_price_per_room has high VIF values(11.393)

type_of_meal_plan_Meal Plan 1 has higher VIF values than 5 it is (7.078) so I will remove it.

Building a Logistic Regression model

Logistic Regression (with statsmodels library)

The variables which have a high p-value mean that they are not significant therefore we can drop the complete variables

Now no feature has p-value greater than 0.05, so we'll consider the features in X_train7 as the final ones and lg2 as final model.

Observations

Coefficient of smarket_segment_type_Offline, required_car_parking_space,arrival_year, arrival_month, no_of_previous_cancellations, and no_of_special_requests are negative increase in these will lead to decrease in chances of a guest cancel the reservation.

Coefficient of no_of_weekend_nights, no_of_week_nights, lead_time, room_type_reserved_Room_Type 4 & 1,avg_price_per_room ,type_of_meal_plan_Not Selected , market_segment_type_Online, are positive an increase in these will lead to increase in chances of a guest cancel the reservation.

Converting coefficients to odds

Coefficient interpretations

Interpretation for other attributes can be done similarly.

Checking model performance on the training set

The confusion matrix

True Positives (TP): we correctly predicted that they do NOT CANCEL 58.37%

True Negatives (TN): we correctly predicted that they do CANCEL 20.11%

False Positives (FP): we incorrectly predicted that they do CANCEL (a "Type I error") 2213 Falsely predict positive Type I error (8.38%)

False Negatives (FN): we incorrectly predicted that they don't CANCEL (a "Type II error") 3471 Falsely predict negative Type II error(13.14%)

ROC-AUC

Logistic Regression model is giving a good performance on training set.

Model Performance Improvement

Let's see if the f1 score can be improved further, by changing the model threshold using AUC-ROC Curve

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Precision and Accuracy of model have decreased but the other metrics have increased. The model is still giving a good performance.

Let's use Precision-Recall curve and see if we can find a better threshold

At the threshold of 0.51, we get balanced recall and precision.

Model is performing well on training set. There's not much improvement in the model performance as the default threshold is 0.36 and here we get 0.51 as the optimal threshold.

Final Model Summary

All the logistic regression models have given a generalized performance on the training set. Recall score shows the best result on logestic regression 0.36 threshold (0.75).

Let's check the performance on the test set

The confusion matrix

True Positives (TP): we correctly predicted that they do NOT CANCEL 58.16%

True Negatives (TN): we correctly predicted that they do CANCEL 20.14%

False Positives (FP): we incorrectly predicted that they do CANCEL (a "Type I error") 937 Falsely predict positive Type I error (8.60%)

False Negatives (FN): we incorrectly predicted that they don't CANCEL (a "Type II error") 1484 Falsely predict negative Type II error(13.11%)

ROC curve on test set

Logistic Regression model is giving a good performance on TEST set.

The confusion matrix

True Positives (TP): we correctly predicted that they do NOT CANCEL 51.01%

True Negatives (TN): we correctly predicted that they do CANCEL 24.82%

False Positives (FP): we incorrectly predicted that they do CANCEL (a "Type I error") 1782 Falsely predict positive Type I error (15.74%)

False Negatives (FN): we incorrectly predicted that they don't CANCEL (a "Type II error") 954 Falsely predict negative Type II error(8.43%)

Precision and Accuracy of model have increased but the other metrics have reduced. The model is still giving a good performance.

All the logistic regression models have given a generalized performance on the training and test set. Recall score shows the best result on logestic regression 0.36 threshold (0.74).

f1_score on the train and test sets are comparable.

Building a Decision Tree model

Checking model performance on training set

Checking model performance on test set

According to the decision tree model, lead_time is the most important variable for predicting the cancelation.

Do we need to prune the tree?

yes there is a complexity and we should reduce the over fitting

Reducing over fitting

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

In tuned decision tree lead_time is the most important feature followed by no_of_special_request.

Cost Complexity Pruning

Model Performance Comparison and Conclusions

Conclusions

We analyzed the "StarHotels Project" using different techniques and used Decision Tree Classifier to build a predictive model for the same.

The model built can be used to predict if a customer is going to cancel their booking or not.

We visualized different trees and their confusion matrix to get a better understanding of the model. Easy interpretation is one of the key benefits of Decision Trees.

lead_time, no_of_special_requests and market_segment_type_Offlines are the most important variable in predicting the guests that will cancel the book. We established the importance of hyper-parameters/ pruning to reduce overfitting.

Actionable Insights and Recommendations

According to the decision tree model -

a) If a guest booking a room with lead_time less than or equal to 158.30 there's a very high chance the guest will not cancel his or her reservation.

b) If a guest booking a room with lead_time less than or equal to 158.30 and no_of_special_requests is less than or equal to 1.62 then there is a very high chance that the guest will not cancel his or her reservation.

c)If a guest booking a room with lead_time less than or equal to 158.30 and no_of_special_requests is less than or equal to 1.62 and market_segment_type_Offline is less than or equal 0.79 then there is a very high chance that the guest will not cancel his or her reservation.

It is observed that the less lead time in the reservation, less number of special requests, and fewer children lead to not cancelation guests booking. By present a good deal or service to people who have a high number of children or have a high special request it may lead to getting a lower chance of canceling between them.

It is observed that repeating guests, who stay in the hotel often and are important to brand equity will cancel their booking with a very small percentage (near to zero). so, I think offering a good deal for new guests for second booking can help the hotel to get a high number of repeat guests.

the online market segment with 82% is on the top of Market segment designation. I believe that managers should invest more in this segment.